relu network
Path-conditioned training: a principled way to rescale ReLU neural networks
Lebeurrier, Arthur, Vayer, Titouan, Gribonval, Rémi
Despite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.
- North America > United States > California (0.04)
- Europe > Germany (0.04)
- Europe > France > Brittany > Ille-et-Vilaine > Rennes (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Inductive biases of multi-task learning and finetuning: multiple regimes of feature reuse
Neural networks are often trained on multiple tasks, either simultaneously (multi-task learning, MTL) or sequentially (pretraining and subsequent finetuning, PT+FT). In particular, it is common practice to pretrain neural networks on a large auxiliary task before finetuning on a downstream task with fewer samples. Despite the prevalence of this approach, the inductive biases that arise from learning multiple tasks are poorly characterized. In this work, we address this gap.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- Asia > Middle East > Israel (0.04)
- North America > United States (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
A Proof of Proposition 2.5
Proposition 2.5 is a direct consequence of the following lemma (remember that Lemma A.1 (Smooth functions conserved through a given flow.) . Assume that @h () ()=0 for all 2 . Let us first show the direct inclusion. Now let us show the converse inclusion. We recall (cf Example 2.10 and Example 2.11) that linear and Assumption 2.9, which we recall reads as: Theorem 2.14, let us show that (9) holds for standard ML losses.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Lower Saxony > Gottingen (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- Asia > Middle East > Israel (0.04)
- Asia > China (0.04)